External and Intrinsic Plagiarism Detection Using Vector Space Models
نویسندگان
چکیده
Plagiarism detection can be divided in external and intrinsic methods. Naive external plagiarism analysis suffers from computationally demanding full nearest neighbor searches within a reference corpus. We present a conceptually simple space partitioning approach to achieve search times sub linear in the number of reference documents, trading precision for speed. We focus on full duplicate searches while achieving acceptable results in the near duplicate case. Intrinsic plagiarism analysis tries to find plagiarized passages within a document without any external knowledge. We use several topic independent stylometric features from which a vector space model for each sentence of a suspicious document is constructed. Plagiarized passages are detected by an outlier analysis relative to the document mean vector. Our system was created for the first PAN competition on plagiarism detection in 2009. The evaluation was performed on the challenge’s development and competition corpora for which we report our results.
منابع مشابه
External & Intrinsic Plagiarism Detection: VSM & Discourse Markers based Approach - Notebook for PAN at CLEF 2011
This paper aims to explain the performance of plagiarism detection system which can detect External as well as Intrinsic Plagiarism in text. It reports the results on PAN-PC-2011 test corpus. We investigated Vector Space Model based techniques for detecting external plagiarism cases and discourse markers based features to detect intrinsic plagiarism cases.
متن کاملApproaches for Intrinsic and External Plagiarism Detection - Notebook for PAN at CLEF 2011
Plagiarism detection has been considered as a classification problem which can be approximated with intrinsic strategies, considering self-based information from a given document, and external strategies, considering comparison techniques between a suspicious document and different sources. In this work, both intrinsic and external approaches for plagiarism detection are presented. First, the m...
متن کاملParagraph Clustering for Intrinsic Plagiarism Detection using a Stylistic Vector Space Model with Extrinsic Features
Our approach to the task of intrinsic plagiarism detection uses a vectorspace model which eschews surface features in favor of richer extrinsic features, including those based on latent semantic analysis in a larger external corpus. We posit that the popularity and success of surface n-gram features is mostly due to the topic-biased nature of current artificial evaluations, a problem which unfo...
متن کاملDetection of Paraphrastic Cases of Mono-lingual and Cross-lingual Plagiarism
External plagiarism detection is a unique retrieval process where the algorithm has to provide an evidence of plagiarism if any for a suspicious section from the pool of source documents available. This paper focuses on paraphrasing involved in detection of plagiarism both from monolingual and cross-lingual aspect. In order to investigate the challenges in detection, we further analyse the perf...
متن کاملExternal Plagiarism Detection
Here we describe our algorithm for detecting external plagiarism in PAN-10 competition. The algorithm has two steps 1. Identification of similar documents and the plagiarized section for a suspicious document with the source documents using Vector Space Model (VSM) and cosine similarity measure and 2. Identify the plagiarized area in the suspicious document using Chunk ratio.
متن کامل